Solar Power Plant Monitoring
Monitoring and identifying potential issues in a solar power plant
Return to Overview
In this analysis, we are evaluating
the performance of a series of panel in a solar power plant. We want to
see if there are any issues with any of the panels that may indicate
needed maintenance, or panels that are sub-optimal.
Data comes from Kaggle
As usual, we
begin by inspecting the data and seeing if there are any missing values
## date_time plant_id source_key dc_power
## Length:68778 Min. :4135001 Length:68778 Min. : 0
## Class :character 1st Qu.:4135001 Class :character 1st Qu.: 0
## Mode :character Median :4135001 Mode :character Median : 429
## Mean :4135001 Mean : 3147
## 3rd Qu.:4135001 3rd Qu.: 6367
## Max. :4135001 Max. :14471
## ac_power daily_yield total_yield
## Min. : 0.00 Min. : 0 Min. :6183645
## 1st Qu.: 0.00 1st Qu.: 0 1st Qu.:6512003
## Median : 41.49 Median :2659 Median :7146685
## Mean : 307.80 Mean :3296 Mean :6978712
## 3rd Qu.: 623.62 3rd Qu.:6274 3rd Qu.:7268706
## Max. :1410.95 Max. :9163 Max. :7846821
## [1] 22
Depending on the time of day, power
generation varies widely. This means looking at averages across the days
is not going to be informative. Means are too prone to outliers, so
we’ll focus on median power generation to get an overlook at the
individual panels.
We’re also dealing with 22 individual panels,
which can make visualizations a little tricky; they’re liable to get a
bit crowded. Case in point:
It isn’t horrible; I may use it to
portray those two panels that are clearly under-performing to a
non-technical group. But we can do better if we want to get more
information:
By plotting the mean against the
median, we can observe just how much outliers in the data effect each
monitor, and how the large variance in the data skews the mean away from
the median. And down there in the left corner are our two potentially
problematic panels that we want to keep in mind, and dig down into a
little later in this analysis. For now, we want to find any potential
issues that have arisen in the past ~month of data that we have.
#-----------Analyze power generation to identify issues------
aggDF <- powerData %>%
mutate(hour = substr(date_time, 12, 13),
hourMinute = substr(date_time, 12, 16),
day = as.Date(substr(date_time, 1, 10), format='%d-%m-%Y')
) %>%
group_by(source_key, hourMinute) %>%
mutate('hourMinuteMean' = mean(ac_power),
'hourMinuteStd' = sd(ac_power)
) %>%
ungroup()
issues <- aggDF %>%
filter(ac_power < (hourMinuteMean - 3*hourMinuteStd)) %>%
mutate('changeFromAverage' = ((ac_power-hourMinuteMean)/hourMinuteMean)*100)
length(unique(issues$source_key))
## [1] 19
I’m including the code here to make
it easier to see the logic process I went through. I took the date_time
column and did a little feature engineering, extrapolating the hour of
the day, and the day itself out into their own columns. My thought
process here is this:
The time of day effects the power
generation greatly, but the day itself should not effect power
generation. So I want to compare similar times of day together and
aggregate on that level, not the day-level. I want to theoretically set
this up to be able to be some kind of live-alerting system.
I
then aggregate the average for each hour of the day; average is okay in
this case since, as stated above, the power production day-to-day at
specific times shouldn’t be widely different. I also get the standard
deviation for each hour of the day to help identify outliers.
At
that point, I can filter out every monitor that produced less than 3
standard deviations of the average power production for each specific
date_time.Looking at how many panels experienced some sort of issue,
which is 19, is pretty much all of them. That seems a bit fishy to me,
so I break down the unique hours that issues arose in.
## [1] "09" "10" "12" "13" "11" "06"
And there it is: a small timeframe
in which the oddities occur in. My guess is that these hours are when
the sun is rising and when the sun is setting, so only a small
subsection of the panels are getting sunlight, and at various
intensities. The odd one out here is the 6th hour, so I dug into that a
little more:
## # A tibble: 2 Ă— 5
## date_time source_key ac_power hourMinuteMean changeFromAverage
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 17-06-2020 06:45 bvBOhCH3iADSZry 0 101. -100
## 2 17-06-2020 06:45 iCRJl6heRkivqQ3 0 107. -100
Oh look! One of those is our little
sub-optimal friend from above. Seems there are a few interesting
indicators for that panel. There is another one that had no power
production at the same time as well.
Seems a good enough time to
look into sub-optimal performing panels.
I do this by defining a
panel as an under-performer if its median power production is lower than
the overall median power production. I then plot the number of these
under-performing pannels over time.
The result tells us that the number
of under-performing panels has improved by a pretty large amount in
June, compared to May. Wins like this are always good to find. Builds
team morale (and makes the stakeholders happy. Mostly that.).
Alright, last thing I want to do is look at those two panels from
before. I want to know if they’ve always been under-performing, or if it
started at a specific date.
Looks like they have pretty
consistently been under performing compared to the average of the
others. That probably means it isn’t an issue of cleaning them off, but
rather they could just be faulty.
In a future analysis, I will
build out a model that uses the associated weather data to predict the
power generation.